Prosper Loan Data Exploration

by Anthony Odiba

Preliminary Wrangling

This document explores the prosper Loans dataset. The data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others. The data set was last updated on 03/11/2014

In this this project we will explore the characteristics of variables that can affect the loan status and to get some ideas about the relationships among multiple variables using summary statistics and data visualizations.

In [710]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
import matplotlib.cm, matplotlib.colors
import seaborn as sb
import geopandas as gpd, folium, requests,fiona, branca, json
import missingno as msno
from wordcloud import WordCloud,STOPWORDS
from PIL import Image
import urllib,requests,random,palettable
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly_express as px

%matplotlib inline
%config InlineBackend.figure_format ='retina'
#from folium import features

Functions

In [390]:
#mask = np.array(Image.open(requests.get('http://pluspng.com/img-png/file-usa-map-black-png-2250.png', stream=True).raw))
#wc=loan.Occupation.value_counts().to_dict()
#wordcloud=WordCloud(width = 512, height = 512,background_color='black', stopwords=STOPWORDS,mask=mask).generate_from_frequencies(wc)
#plt.figure(figsize=(8,6),facecolor = 'white', edgecolor='blue')
##plt.tight_layout(pad=0)
#plt.imshow(wordcloud, interpolation='bilinear')
#plt.axis("off")
##plt.tight_layout(pad=0)
#plt.show()
"""There were issues with the image files, encountered a ValuError:bad transparency mask 
https://stackoverflow.com/questions/31273592/valueerror-bad-transparency-mask-when-pasting-one-image-onto-another-with-pyt"""
Out[390]:
'There were issues with the image files, encountered a ValuError:bad transparency mask \nhttps://stackoverflow.com/questions/31273592/valueerror-bad-transparency-mask-when-pasting-one-image-onto-another-with-pyt'
In [391]:
prosper=("#FE7702","#E10C79")               

#from palettable.colorbrewer.sequential import PuRd_6
#def grey_color_func(word, font_size, position, orientation, random_state=None, **kwargs):
#    return "hsl(0, 0%%, %d%%)" % random.randint(10, 50)
#def color_func(word, font_size, position, orientation, random_state=None, **kwargs):
#    return tuple(PuRd_6.colors[random.randint(2,8)])

def prosper_hue(word,font_size,position,orientation,random_state=None, **kwargs):
    return np.random.choice(prosper)                
In [577]:
#cell block containing various color elements used in notebook
def set_my_style():
    sb.set()
    hue_palette=sb.color_palette(prosper_hue2)
    sb.set_palette(hue_palette)
    
prosper_hue2 =['#84807A',"#FE7702","#E10C79"]

prosper_hue3 =['#84807A',"#FE7702","#bfff00","#E10C79","#0000ff","#FFC300","#4A235A"]

#used palettable colors because it's more vibrant than the one seaborn has to offer
prosper_hue3_plasma=palettable.matplotlib.Plasma_3.hex_colors
prosper_hue5_plasma=palettable.matplotlib.Plasma_5_r.hex_colors
prosper_hue_medium=palettable.matplotlib.Inferno_8.hex_colors
prosper_hue_medium_plasma=palettable.matplotlib.Plasma_8.hex_colors
prosper_hue_medium_plasma_r=palettable.matplotlib.Plasma_8_r.hex_colors
prosper_hue_11=palettable.matplotlib.Inferno_11.hex_colors
prosper_hue_11_plasma=palettable.matplotlib.Plasma_11.hex_colors
prosper_hue_many=palettable.matplotlib.Inferno_20.hex_colors #for plots with many distinct categories
prosper_hue_many_plasma=palettable.matplotlib.Plasma_11_r.hex_colors


Explore the Dataset

Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.

In [393]:
#load the loan data into a dataset
loan=pd.read_csv('prosperLoanData.csv')
In [394]:
loan5=loan.copy()
In [395]:
loan.shape
Out[395]:
(113937, 81)
In [396]:
loan.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 81 columns):
ListingKey                             113937 non-null object
ListingNumber                          113937 non-null int64
ListingCreationDate                    113937 non-null object
CreditGrade                            28953 non-null object
Term                                   113937 non-null int64
LoanStatus                             113937 non-null object
ClosedDate                             55089 non-null object
BorrowerAPR                            113912 non-null float64
BorrowerRate                           113937 non-null float64
LenderYield                            113937 non-null float64
EstimatedEffectiveYield                84853 non-null float64
EstimatedLoss                          84853 non-null float64
EstimatedReturn                        84853 non-null float64
ProsperRating (numeric)                84853 non-null float64
ProsperRating (Alpha)                  84853 non-null object
ProsperScore                           84853 non-null float64
ListingCategory (numeric)              113937 non-null int64
BorrowerState                          108422 non-null object
Occupation                             110349 non-null object
EmploymentStatus                       111682 non-null object
EmploymentStatusDuration               106312 non-null float64
IsBorrowerHomeowner                    113937 non-null bool
CurrentlyInGroup                       113937 non-null bool
GroupKey                               13341 non-null object
DateCreditPulled                       113937 non-null object
CreditScoreRangeLower                  113346 non-null float64
CreditScoreRangeUpper                  113346 non-null float64
FirstRecordedCreditLine                113240 non-null object
CurrentCreditLines                     106333 non-null float64
OpenCreditLines                        106333 non-null float64
TotalCreditLinespast7years             113240 non-null float64
OpenRevolvingAccounts                  113937 non-null int64
OpenRevolvingMonthlyPayment            113937 non-null float64
InquiriesLast6Months                   113240 non-null float64
TotalInquiries                         112778 non-null float64
CurrentDelinquencies                   113240 non-null float64
AmountDelinquent                       106315 non-null float64
DelinquenciesLast7Years                112947 non-null float64
PublicRecordsLast10Years               113240 non-null float64
PublicRecordsLast12Months              106333 non-null float64
RevolvingCreditBalance                 106333 non-null float64
BankcardUtilization                    106333 non-null float64
AvailableBankcardCredit                106393 non-null float64
TotalTrades                            106393 non-null float64
TradesNeverDelinquent (percentage)     106393 non-null float64
TradesOpenedLast6Months                106393 non-null float64
DebtToIncomeRatio                      105383 non-null float64
IncomeRange                            113937 non-null object
IncomeVerifiable                       113937 non-null bool
StatedMonthlyIncome                    113937 non-null float64
LoanKey                                113937 non-null object
TotalProsperLoans                      22085 non-null float64
TotalProsperPaymentsBilled             22085 non-null float64
OnTimeProsperPayments                  22085 non-null float64
ProsperPaymentsLessThanOneMonthLate    22085 non-null float64
ProsperPaymentsOneMonthPlusLate        22085 non-null float64
ProsperPrincipalBorrowed               22085 non-null float64
ProsperPrincipalOutstanding            22085 non-null float64
ScorexChangeAtTimeOfListing            18928 non-null float64
LoanCurrentDaysDelinquent              113937 non-null int64
LoanFirstDefaultedCycleNumber          16952 non-null float64
LoanMonthsSinceOrigination             113937 non-null int64
LoanNumber                             113937 non-null int64
LoanOriginalAmount                     113937 non-null int64
LoanOriginationDate                    113937 non-null object
LoanOriginationQuarter                 113937 non-null object
MemberKey                              113937 non-null object
MonthlyLoanPayment                     113937 non-null float64
LP_CustomerPayments                    113937 non-null float64
LP_CustomerPrincipalPayments           113937 non-null float64
LP_InterestandFees                     113937 non-null float64
LP_ServiceFees                         113937 non-null float64
LP_CollectionFees                      113937 non-null float64
LP_GrossPrincipalLoss                  113937 non-null float64
LP_NetPrincipalLoss                    113937 non-null float64
LP_NonPrincipalRecoverypayments        113937 non-null float64
PercentFunded                          113937 non-null float64
Recommendations                        113937 non-null int64
InvestmentFromFriendsCount             113937 non-null int64
InvestmentFromFriendsAmount            113937 non-null float64
Investors                              113937 non-null int64
dtypes: bool(3), float64(50), int64(11), object(17)
memory usage: 68.1+ MB
In [397]:
sum(loan.duplicated())
Out[397]:
0
In [398]:
msno.bar(loan,fontsize=14,color='#84807A')
Out[398]:
<matplotlib.axes._subplots.AxesSubplot at 0x15b6f4320>
In [399]:
pd.set_option("display.max_columns", len(loan.columns))
loan.sample(5)
Out[399]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn ProsperRating (numeric) ProsperRating (Alpha) ProsperScore ListingCategory (numeric) BorrowerState Occupation EmploymentStatus EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup GroupKey DateCreditPulled CreditScoreRangeLower CreditScoreRangeUpper FirstRecordedCreditLine CurrentCreditLines OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization AvailableBankcardCredit TotalTrades TradesNeverDelinquent (percentage) TradesOpenedLast6Months DebtToIncomeRatio IncomeRange IncomeVerifiable StatedMonthlyIncome LoanKey TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate ProsperPrincipalBorrowed ProsperPrincipalOutstanding ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter MemberKey MonthlyLoanPayment LP_CustomerPayments LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
65405 714A36010186148094AE3D2 1166277 2014-01-29 10:17:53.463000000 NaN 36 Current NaN 0.32446 0.2850 0.2750 0.25006 0.1475 0.10256 2.0 E 3.0 1 NC Other Employed 33.0 False False NaN 2014-01-29 10:17:55 640.0 659.0 2001-12-01 00:00:00 5.0 5.0 6.0 2 126.0 1.0 5.0 0.0 0.0 0.0 0.0 0.0 2941.0 0.00 0.0 6.0 1.00 0.0 0.22 $50,000-74,999 True 4166.666667 C527370522114331806B7B5 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 1 130034 4000 2014-02-04 00:00:00 Q1 2014 09FF3705275666986F7C14C 166.54 166.54 79.09 87.45 -3.07 0.0 0.0 0.0 0.0 1.0 0 0 0.0 3
98832 2EF935683775789103C3F2E 700226 2013-01-16 18:04:02.187000000 NaN 36 Current NaN 0.31790 0.2786 0.2686 0.24470 0.1325 0.11220 2.0 E 2.0 1 MA Postal Service Employed 271.0 True False NaN 2013-01-16 18:01:53 760.0 779.0 1986-06-01 00:00:00 6.0 5.0 23.0 3 472.0 0.0 1.0 1.0 3132.0 0.0 0.0 0.0 10925.0 0.45 13080.0 22.0 0.81 0.0 0.27 $50,000-74,999 True 5000.000000 7F77367271654088584202E NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 14 83293 4000 2013-01-29 00:00:00 Q1 2013 75BD3563426890262433035 165.15 2146.95 1084.94 1062.01 -38.13 0.0 0.0 0.0 0.0 1.0 0 0 0.0 77
65518 E97C35862199500931E5C96 865356 2013-08-09 17:05:28.927000000 NaN 60 Current NaN 0.31848 0.2916 0.2816 0.25660 0.1325 0.12410 2.0 E 2.0 18 NJ Other Employed 275.0 True False NaN 2013-08-09 17:05:01 680.0 699.0 1984-07-09 00:00:00 36.0 31.0 96.0 27 1034.0 2.0 17.0 0.0 0.0 0.0 0.0 0.0 35301.0 0.66 15424.0 85.0 1.00 1.0 0.25 $100,000+ True 8494.083333 9F1936903613102155F8B76 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 7 99149 2000 2013-08-16 00:00:00 Q3 2013 50F035253692736800B28B0 63.68 382.08 93.71 288.37 -9.89 0.0 0.0 0.0 0.0 1.0 0 0 0.0 14
7605 559A3516989295006204176 510576 2011-06-08 19:35:09.100000000 NaN 36 Current NaN 0.22362 0.1899 0.1799 0.17570 0.0610 0.11470 4.0 C 6.0 1 MA Computer Programmer Employed 63.0 True False NaN 2011-07-24 07:30:48 660.0 679.0 2000-07-11 00:00:00 19.0 16.0 33.0 5 201.0 0.0 8.0 0.0 0.0 0.0 0.0 0.0 6263.0 0.43 6496.0 33.0 0.81 0.0 0.37 $75,000-99,999 True 6333.333333 2F4636267776608821BBC20 2.0 53.0 53.0 0.0 0.0 6000.0 1061.46 -63.0 0 NaN 31 52292 15000 2011-08-18 00:00:00 Q3 2011 824134139671780667EC278 549.76 16492.80 11861.83 4630.97 -243.87 0.0 0.0 0.0 0.0 1.0 0 0 0.0 91
62137 0CF93586410706624BA46B8 864928 2013-08-08 16:22:30.737000000 NaN 36 Current NaN 0.27285 0.2346 0.2246 0.20640 0.1025 0.10390 3.0 D 3.0 3 IL Truck Driver Self-employed 10.0 False False NaN 2013-07-31 13:58:39 700.0 719.0 1977-08-01 00:00:00 9.0 8.0 36.0 7 133.0 0.0 4.0 0.0 0.0 14.0 1.0 0.0 3657.0 0.10 26071.0 30.0 0.67 4.0 NaN $25,000-49,999 False 3000.000000 FB5F3691024791980995CA4 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 7 99427 4000 2013-08-16 00:00:00 Q3 2013 B10A3584373960496ABC278 155.80 934.80 485.20 449.60 -19.17 0.0 0.0 0.0 0.0 1.0 0 0 0.0 44

change the datatypes of the LoanOriginationDate and the ListingCreationDate to datetime dtypes

In [400]:
print(loan['LoanOriginationDate'].dtype)
object
In [401]:
print(loan.ListingCreationDate.dtype)
object
In [402]:
loan['LoanOriginationDate']=pd.to_datetime(loan['LoanOriginationDate'])
loan['ListingCreationDate']=pd.to_datetime(loan['ListingCreationDate'])
In [403]:
assert loan.LoanOriginationDate.dtype == np.dtype('<M8[ns]') and loan.ListingCreationDate.dtype == np.dtype('<M8[ns]')
In [404]:
loan.LoanOriginationDate.min()
Out[404]:
Timestamp('2005-11-15 00:00:00')

I am interested in the Rating scores(both old and new) so I explored the relevant cloumns and formatted them as appropriate

In [405]:
#cant attach .dropna() because CreditGrade and any ProsperScore do not have any intersecting values thus error
loan.loc[:,["ProsperRating (numeric)","ProsperRating (Alpha)","ProsperScore","CreditGrade"]].sample(5)
Out[405]:
ProsperRating (numeric) ProsperRating (Alpha) ProsperScore CreditGrade
82436 NaN NaN NaN A
64995 3.0 D 6.0 NaN
95714 NaN NaN NaN B
99345 6.0 A 4.0 NaN
96680 NaN NaN NaN E

ProsperScore: A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009.

ProsperRating (numeric): The Prosper Rating assigned at the time the listing was created: 0 - N/A, 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA. Applicable for loans originated after July 2009.

ProsperRating (Alpha): The Prosper Rating assigned at the time the listing was created between AA - HR. Applicable for loans originated after July 2009.

CreditGrade: The Credit rating that was assigned at the time the listing went live. Applicable for listings pre-2009 period and will only be populated for those listings.

In [406]:
sorted(loan.ProsperScore.unique())
Out[406]:
[nan, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0]
In [407]:
sorted(loan['ProsperRating (numeric)'].unique())
Out[407]:
[nan, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
In [408]:
sorted(loan['ProsperRating (Alpha)'].unique()[1:],reverse=True)
Out[408]:
['HR', 'E', 'D', 'C', 'B', 'AA', 'A']
In [409]:
CG=loan['CreditGrade'].unique().tolist()
CG
Out[409]:
['C', nan, 'HR', 'AA', 'D', 'B', 'E', 'A', 'NC']
In [410]:
del CG[1]
sorted(CG,reverse=True)
Out[410]:
['NC', 'HR', 'E', 'D', 'C', 'B', 'AA', 'A']
In [411]:
loan[['ProsperScore',"ProsperRating (numeric)","ProsperRating (Alpha)"]][loan.ProsperScore== 11.0].sample(10)
Out[411]:
ProsperScore ProsperRating (numeric) ProsperRating (Alpha)
55336 11.0 7.0 AA
25866 11.0 7.0 AA
110806 11.0 7.0 AA
30273 11.0 7.0 AA
105005 11.0 6.0 A
110438 11.0 7.0 AA
109863 11.0 5.0 B
93429 11.0 7.0 AA
72818 11.0 7.0 AA
58024 11.0 6.0 A
In [412]:
# "ProsperRating (numeric)","ProsperRating (Alpha)","ProsperScore","CreditGrade" into ordered categorical types
ordinal_var_dict = {'ProsperScore': [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0],
                    'ProsperRating (numeric)': [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
                    'ProsperRating (Alpha)': ['HR', 'E', 'D', 'C', 'B', 'A', 'AA'],
                    'CreditGrade':['NC', 'HR', 'E', 'D', 'C', 'B', 'A', 'AA']
                   }

for var in ordinal_var_dict:
    ordered_var = pd.api.types.CategoricalDtype(ordered = True,
                                                categories = ordinal_var_dict[var])
    loan[var] = loan[var].astype(ordered_var)
In [413]:
loan.describe()
Out[413]:
ListingNumber Term BorrowerAPR BorrowerRate LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn ListingCategory (numeric) EmploymentStatusDuration CreditScoreRangeLower CreditScoreRangeUpper CurrentCreditLines OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization AvailableBankcardCredit TotalTrades TradesNeverDelinquent (percentage) TradesOpenedLast6Months DebtToIncomeRatio StatedMonthlyIncome TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate ProsperPrincipalBorrowed ProsperPrincipalOutstanding ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber LoanOriginalAmount MonthlyLoanPayment LP_CustomerPayments LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
count 1.139370e+05 113937.000000 113912.000000 113937.000000 113937.000000 84853.000000 84853.000000 84853.000000 113937.000000 106312.000000 113346.000000 113346.000000 106333.000000 106333.000000 113240.000000 113937.00000 113937.000000 113240.000000 112778.000000 113240.000000 106315.000000 112947.000000 113240.000000 106333.000000 1.063330e+05 106333.000000 106393.000000 106393.000000 106393.000000 106393.000000 105383.000000 1.139370e+05 22085.000000 22085.000000 22085.000000 22085.000000 22085.000000 22085.000000 22085.000000 18928.000000 113937.000000 16952.000000 113937.000000 113937.000000 113937.00000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000
mean 6.278857e+05 40.830248 0.218828 0.192764 0.182701 0.168661 0.080306 0.096068 2.774209 96.071582 685.567731 704.567731 10.317192 9.260164 26.754539 6.96979 398.292161 1.435085 5.584405 0.592052 984.507059 4.154984 0.312646 0.015094 1.759871e+04 0.561309 11210.225447 23.230034 0.885897 0.802327 0.275947 5.608026e+03 1.421100 22.934345 22.271949 0.613629 0.048540 8472.311961 2930.313906 -3.223214 152.816539 16.268464 31.896882 69444.474271 8337.01385 272.475783 4183.079489 3105.536588 1077.542901 -54.725641 -14.242698 700.446342 681.420499 25.142686 0.998584 0.048027 0.023460 16.550751 80.475228
std 3.280762e+05 10.436212 0.080364 0.074818 0.074516 0.068467 0.046764 0.030403 3.996797 94.480605 66.458275 66.458275 5.457866 5.022644 13.637871 4.63097 447.159711 2.437507 6.429946 1.978707 7158.270157 10.160216 0.727868 0.154092 3.293640e+04 0.317918 19818.361309 11.871311 0.148179 1.097637 0.551759 7.478497e+03 0.764042 19.249584 18.830425 2.446827 0.556285 7395.507650 3806.635075 50.063567 466.320254 9.005898 29.974184 38930.479610 6245.80058 192.697812 4790.907234 4069.527670 1183.414168 60.675425 109.232758 2388.513831 2357.167068 275.657937 0.017919 0.332353 0.232412 294.545422 103.239020
min 4.000000e+00 12.000000 0.006530 0.000000 -0.010000 -0.182700 0.004900 -0.182700 0.000000 0.000000 0.000000 19.000000 0.000000 0.000000 2.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -209.000000 0.000000 0.000000 0.000000 1.000000 1000.00000 0.000000 -2.349900 0.000000 -2.349900 -664.870000 -9274.750000 -94.200000 -954.550000 0.000000 0.700000 0.000000 0.000000 0.000000 1.000000
25% 4.009190e+05 36.000000 0.156290 0.134000 0.124200 0.115670 0.042400 0.074080 1.000000 26.000000 660.000000 679.000000 7.000000 6.000000 17.000000 4.00000 114.000000 0.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.121000e+03 0.310000 880.000000 15.000000 0.820000 0.000000 0.140000 3.200333e+03 1.000000 9.000000 9.000000 0.000000 0.000000 3500.000000 0.000000 -35.000000 0.000000 9.000000 6.000000 37332.000000 4000.00000 131.620000 1005.760000 500.890000 274.870000 -73.180000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 2.000000
50% 6.005540e+05 36.000000 0.209760 0.184000 0.173000 0.161500 0.072400 0.091700 1.000000 67.000000 680.000000 699.000000 10.000000 9.000000 25.000000 6.00000 271.000000 1.000000 4.000000 0.000000 0.000000 0.000000 0.000000 0.000000 8.549000e+03 0.600000 4100.000000 22.000000 0.940000 0.000000 0.220000 4.666667e+03 1.000000 16.000000 15.000000 0.000000 0.000000 6000.000000 1626.550000 -3.000000 0.000000 14.000000 21.000000 68599.000000 6500.00000 217.740000 2583.830000 1587.500000 700.840100 -34.440000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 44.000000
75% 8.926340e+05 36.000000 0.283810 0.250000 0.240000 0.224300 0.112000 0.116600 3.000000 137.000000 720.000000 739.000000 13.000000 12.000000 35.000000 9.00000 525.000000 2.000000 7.000000 0.000000 0.000000 3.000000 0.000000 0.000000 1.952100e+04 0.840000 13180.000000 30.000000 1.000000 1.000000 0.320000 6.825000e+03 2.000000 33.000000 32.000000 0.000000 0.000000 11000.000000 4126.720000 25.000000 0.000000 22.000000 65.000000 101901.000000 12000.00000 371.580000 5548.400000 4000.000000 1458.540000 -13.920000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 115.000000
max 1.255725e+06 60.000000 0.512290 0.497500 0.492500 0.319900 0.366000 0.283700 20.000000 755.000000 880.000000 899.000000 59.000000 54.000000 136.000000 51.00000 14985.000000 105.000000 379.000000 83.000000 463881.000000 99.000000 38.000000 20.000000 1.435667e+06 5.950000 646285.000000 126.000000 1.000000 20.000000 10.010000 1.750003e+06 8.000000 141.000000 141.000000 42.000000 21.000000 72499.000000 23450.950000 286.000000 2704.000000 44.000000 100.000000 136486.000000 35000.00000 2251.510000 40702.390000 35000.000000 15617.030000 32.060000 0.000000 25000.000000 25000.000000 21117.900000 1.012500 39.000000 33.000000 25000.000000 1189.000000

What is the structure of your dataset?

There are 113937 unique entries in this dataset and 81 different variables. Most of the variables are numeric with a few categorical variables. The variables can be grouped into four broad categories:

  • Loan Status : The status of the loan list, such as Cancelled, Charged off, Completed, Current, Defaulted, Final Payment In Progress, Past Due.
  • Borrower Data : Basic properties about borrowers such as income, occupation, employment status, etc.
  • Loan Data : Basic properties about the loan such as length of the loan(term), Borrower APR, etc.
  • Credit Risk Metrics : Metrics measured the risk of loans, such as Credit grade, Prosper Score, bank card utilization, etc.

What is/are the main feature(s) of interest in your dataset?

Since this is a lending platform, the features of interest to me are those that helps in answering credit analysis questions. i.e determining an entity’s debt servicing capacity, or its ability to repay.

Credit Analysis is the process of drawing conclusions from available data (both quantitative and qualitative) regarding the credit – worthiness of an entity, and making recommendations regarding the perceived needs, and risks.

I am curious to know what factors affect Loan performance, therefore i will be loking at LoanStatus, Prosper Score and CreditGrade

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

Technically speaking, the CreditGrade and ProsperScore are supposed to be the major determinants of the loan performance, because the credit rating/score are issued after analysis of other parameters. so firstly lets eplore the variables in our dataset that I feel will have a strong effect on how the loans will turn out, then we will see how they relate with CreditGrade and ProsperScore.

The features I fell will be exploring are ,Occupation, Location, IncomeRange, LoanStatus, LoanOriginalAmount, CurrentDelinquencies, IsBorrowerHomeowner, EmploymentStatus, BankcardUtilization, TotalProsperLoans, DebttoIncomeRatio, BorrowerAPR, BorrowerRate and finally CreditGrade and ProsperScore

Univariate Exploration

Let's Explore Borrowers by Location

In [691]:
gdf = gpd.read_file("https://raw.githubusercontent.com/PublicaMundi/MappingAPI/master/data/geojson/us-states.json")
statesabbr=pd.read_json("https://gist.githubusercontent.com/mshafrir/2646763/raw/8b0dbb93521f5d6889502305335104218454c2bf/states_titlecase.json")
In [692]:
gdf.columns
Out[692]:
Index(['id', 'name', 'density', 'geometry'], dtype='object')
In [693]:
statesabbr.columns
Out[693]:
Index(['abbreviation', 'name'], dtype='object')
In [694]:
gdf1=pd.merge(left=gdf,right=statesabbr,how='left',on=['name'])
In [695]:
gdf1.head()
Out[695]:
id name density geometry abbreviation
0 01 Alabama 94.650 POLYGON ((-87.359296 35.00118, -85.606675 34.9... AL
1 02 Alaska 1.264 (POLYGON ((-131.602021 55.117982, -131.569159 ... AK
2 04 Arizona 57.050 POLYGON ((-109.042503 37.000263, -109.04798 31... AZ
3 05 Arkansas 56.430 POLYGON ((-94.473842 36.501861, -90.152536 36.... AR
4 06 California 241.700 POLYGON ((-123.233256 42.006186, -122.378853 4... CA
In [697]:
gdf1.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 52 entries, 0 to 51
Data columns (total 5 columns):
id              52 non-null object
name            52 non-null object
density         52 non-null float64
geometry        52 non-null object
abbreviation    51 non-null object
dtypes: float64(1), object(4)
memory usage: 2.4+ KB
In [420]:
gdf1.rename(columns={'abbreviation':'BorrowerState'},inplace=True)
In [421]:
#gdf1.head()
In [422]:
loan1=loan.copy()
gdf_test_df=gdf1.copy()
In [423]:
loan_test_df=loan1.groupby('BorrowerState',as_index=False)['ListingKey'].count()
In [424]:
loan.BorrowerState.unique()
Out[424]:
array(['CO', 'GA', 'MN', 'NM', 'KS', 'CA', 'IL', 'MD', nan, 'AL', 'AZ',
       'VA', 'FL', 'PA', 'OR', 'MI', 'NY', 'LA', 'WI', 'OH', 'NC', 'WA',
       'NV', 'NJ', 'TX', 'AR', 'SC', 'DE', 'MO', 'NE', 'UT', 'DC', 'MA',
       'CT', 'IN', 'KY', 'OK', 'MS', 'WV', 'RI', 'TN', 'ID', 'MT', 'HI',
       'NH', 'VT', 'WY', 'ME', 'AK', 'IA', 'SD', 'ND'], dtype=object)
In [425]:
#loan_test_df
In [426]:
gdf_test_df=pd.merge(left=gdf_test_df,right=loan_test_df,how='left',on=['BorrowerState'])
In [427]:
gdf_test_df.head()
Out[427]:
id name density geometry BorrowerState ListingKey
0 01 Alabama 94.650 POLYGON ((-87.359296 35.00118, -85.606675 34.9... AL 1679.0
1 02 Alaska 1.264 (POLYGON ((-131.602021 55.117982, -131.569159 ... AK 200.0
2 04 Arizona 57.050 POLYGON ((-109.042503 37.000263, -109.04798 31... AZ 1901.0
3 05 Arkansas 56.430 POLYGON ((-94.473842 36.501861, -90.152536 36.... AR 855.0
4 06 California 241.700 POLYGON ((-123.233256 42.006186, -122.378853 4... CA 14717.0
In [428]:
gdf_test_df.crs = fiona.crs.from_epsg(4326)
In [429]:
gdf_test_df[gdf_test_df.name=='District of Columbia']
Out[429]:
id name density geometry BorrowerState ListingKey
8 11 District of Columbia 10065.0 POLYGON ((-77.035264 38.993869, -76.909294 38.... NaN NaN
In [433]:
gdf_test_df=gdf_test_df[gdf_test_df['name']!='District of Columbia']
In [434]:
gdf_test_df.loc[51,'ListingKey']=0
In [435]:
gdf_test_df.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 51 entries, 0 to 51
Data columns (total 6 columns):
id               51 non-null object
name             51 non-null object
density          51 non-null float64
geometry         51 non-null object
BorrowerState    51 non-null object
ListingKey       51 non-null float64
dtypes: float64(2), object(4)
memory usage: 5.3+ KB
In [690]:
gdf_test_df.id.dtype
Out[690]:
dtype('O')
In [687]:
#gdf_test_df.to_csv('prosper_loan_gdf_map.csv',index=False)
In [436]:
variable = 'ListingKey'
gdf_test_df=gdf_test_df.sort_values(by=variable, ascending=True)
colormap = folium.LinearColormap(colors=["red","orange","yellow","green"],vmin=gdf_test_df.loc[gdf_test_df[variable]>0, variable].min(), 
                                 vmax=gdf_test_df.loc[gdf_test_df[variable]>0, variable].max()).to_step(n=5)
In [712]:
centroid=gdf_test_df.geometry.centroid

m=folium.Map(location=[centroid.y.mean(), centroid.x.mean()], zoom_start=4, tiles='StamenWatercolor')

#tooltip=folium.features.Tooltip()
#folium.

folium.GeoJson(gdf_test_df[['geometry','name',variable]],
               name="United States of America",
               style_function=lambda x: {"weight":2, 'color':'black','fillColor':colormap(x['properties'][variable]), 'fillOpacity':0.2},
              highlight_function=lambda x: {'weight':3, 'color':'black'},
               smooth_factor=2.0,
               #tooltip=tooltip
              tooltip=folium.features.GeoJsonTooltip(fields=['name',variable],
                                            aliases=['State','Number of loans by state'], 
                                              labels=True, 
                                              sticky=True,
                                             )
              ).add_to(m)


colormap.add_to(m)

folium.LayerControl(autoZIndex=False, collapsed=False).add_to(m)
m
Out[712]:

From the map visualisation, we can see that most Prosper borrowers come from California.

I wanted to find out why and I did a little search and it turns out, the company was founded in California, USA in 2005, it's headquarters is also San Francisco.

I thought the situation of their offices was a possible reason but then I checked their other office location: Phoenix, Arizona and there, they had just 1901 Prosper Borrowers so i abandoned that theory.

Lastly I checked U.S.A's population distribution by state and the Prosper Borrower number mirrors that.


Lets see what listed profession are borrowing the most

In [438]:
loan.Occupation.value_counts().head()
Out[438]:
Other                  28617
Professional           13628
Computer Programmer     4478
Executive               4311
Teacher                 3759
Name: Occupation, dtype: int64
In [439]:
img_link='https://raw.githubusercontent.com/tonyodiba/Prosper_Loans_data_exploration/master/files/Usa-pfadSvgTutorial.svg.png'
In [440]:
img_link2='https://raw.githubusercontent.com/tonyodiba/Prosper_Loans_data_exploration/master/files/326-3267244_prosper-loans-on-twitter-prosper-marke.png'
In [441]:
icon=Image.open(requests.get(img_link2,stream=True).raw).convert("RGBA")
In [442]:
mask = Image.new("RGBA", icon.size, (255,255,255))
mask.paste(icon,icon)
mask = np.array(mask)
In [443]:
wc=loan.Occupation.value_counts().to_dict()
wordcloud=WordCloud(width = 400, height = 400,
                    background_color='white',max_font_size=300,
                    random_state=123,stopwords=STOPWORDS,mask=mask).generate_from_frequencies(wc)
#wordcloud.recolor(color_func=color_func, random_state=7)
wordcloud.recolor(color_func=prosper_hue,random_state=53)
plt.figure(figsize=(10,10),facecolor = 'white', edgecolor='blue')
plt.imshow(wordcloud, interpolation='bilinear')
plt.tight_layout(pad=2)
plt.axis("off")
plt.show()
In [444]:
set_my_style()
#sb.set_palette(palette=cpal)
sb.set_context('notebook')
ax=loan.Occupation.value_counts()[19::-1].plot.barh(figsize=(8,6),color=prosper_hue2[0])
plt.title('Top 20 Listed Occupations')
plt.xlabel('count')
plt.ylabel('Occupation')
plt.show()
In [445]:
set_my_style()
sb.set_context('notebook')
#prosper_gradient=palettable.cmocean.sequential.Gray_20.hex_colors
#sb.set_palette(palette=prosper_gradient)
ax=loan.Occupation.value_counts()[2:].plot.bar(figsize=(12,6))
plt.title('Listed Occupations apart from "Others" and "Professional"')
plt.ylabel('count')
plt.xlabel('Occupation')
plt.show()

From the graph, we can see that a majority of people have other Occupations that the ones listed, second most common is Professional which is a broad classification and doesnt tell us much. When we remove two highest Occupation classifications, we see that Computer Programmer is the Highest and it gradually reduces till the last which is Student-Technical School.


Which Income Bracket seems to be borrowing the Most ? what about Employment Status Group?

In [446]:
loan.IncomeRange=loan.IncomeRange.astype('category')
In [447]:
plt.figure(figsize = [17, 5])
plt.subplots_adjust(wspace = 0.3)

plt.subplot(121)
set_my_style()
sb.set_context('notebook')
ax=loan.IncomeRange.value_counts().plot.bar(alpha=0.9,color=prosper_hue2[1])
plt.title('Income Rate Distribution')
plt.xlabel('Income Bracket')
plt.ylabel('Count')

plt.subplot(122)
ax=loan.EmploymentStatus.value_counts().plot.bar(alpha=0.9,color=prosper_hue2[0])
plt.title('Employment Status')
#plt.xlabel('Loan Status')
plt.ylabel('Count')
plt.xlabel('Employment Status')

plt.show()

What is the Loan Amount most requested for?

In [448]:
sb.distplot(loan.LoanOriginalAmount,bins=5,color='#FE7702')
plt.title("LoanOriginalAmount showing Distribution")
plt.ylabel('Probability')
plt.show()
In [449]:
binsize=500
bins=np.arange(0,loan.LoanOriginalAmount.max()+binsize,binsize)
#fig, 
plt.figure(figsize = [17, 5])
plt.subplots_adjust(wspace = 0.3)
loan_mean=loan.LoanOriginalAmount.mean()

ax=plt.subplot(1,2,1)
sb.distplot(loan.LoanOriginalAmount,bins=bins,kde=False)
plt.axvline(loan_mean,color='b',label='Avg. Loan Amount')
ax.text(loan_mean,-500,"{:.0f}".format(loan_mean),rotation=90,color='b')
plt.title("LoanOriginalAmount showing Count")
plt.xticks(rotation=90)
plt.ylabel('count')
plt.legend()

ax=plt.subplot(1,2,2)
sb.distplot(loan.LoanOriginalAmount,bins=bins,color='#FE7702')
plt.title("LoanOriginalAmount showing Distribution")
plt.xticks(rotation=90)
plt.ylabel('Probability')

plt.show()

We have a right skewed multi-modal distribution for the LoanOriginalAmount. The initial figure showing loanOriginalAmount distribution shows that a lot of the loans are more than \$0 and less than $10,000.

I then increased the binsize and added a line that shows the average mark for the loan amounts and we can see clearly that a lot of the loan amounts are concentrated to the left of the graph.

Also, we can now see that the are several peaks at \$5,000, \$10,000, \$15,000 and also to a lesser degree at \$20,000, and \$25,000.

Interestingly, there's a steep jump in frequency at these amounts rather than a smooth ramp up. This suggests that people are more likely to request for loan amounts that are multiples of \$5,000.


Debt to Income Ratio

In [450]:
Loan_hist=loan.copy()
In [451]:
Loan_hist.dropna(subset=['DebtToIncomeRatio'],inplace=True)
In [654]:
#loan.DebtToIncomeRatio
plt.figure(figsize=(15,4.4))
plt.subplots_adjust(wspace = 0.5)
plt.subplot(121)
ax=sb.distplot(Loan_hist['DebtToIncomeRatio'],kde=False,
                color='#FE7702')
ax.set_ylabel('Count')
ax.set_title('Debt to Income Ratio')

plt.subplot(122)
ax=sb.distplot(Loan_hist['DebtToIncomeRatio'],kde=False,
                color='#FE7702')
ax.set_ylabel('Count')
ax.set_xscale('log')
ax.set_xlabel('DebtToIncomeRatio logScale')
ax.set_title('Debt to Income Ratio')

plt.show()

Most of the borrowers have a debt-to-income ratio below 1.0, and the tail gets really small past that number. I limited the x-axis so we can gain more insight from our plot; The new plot shows an outlier value near 10.

Current Delinquencies

In [453]:
loan_delinquencies=loan.copy()
In [454]:
loan_delinquencies.dropna(subset=['CurrentDelinquencies'],inplace=True)
In [455]:
#loan_delinquencies.CurrentDelinquencies
In [653]:
plt.figure(figsize=(17,4.4))
plt.subplots_adjust(wspace = 0.3)
plt.subplot(131)
ax=sb.distplot(loan_delinquencies['CurrentDelinquencies'],kde=False,hist=True)
ax.set_ylabel('Count')
ax.set_title('Current Delinquencies')

plt.subplot(132)
ax=sb.distplot(loan_delinquencies['CurrentDelinquencies'],kde=False,hist=True)
ax.set_xscale('log')
ax.set_xlabel('CurrentDelinquenciies logScale')
ax.set_ylabel('Count')
ax.set_title('Current Delinquencies')

plt.subplot(133)
ax=sb.distplot(loan_delinquencies['CurrentDelinquencies'],kde=False)
ax.set_xlim(0,10)
ax.set_xlabel('CurrentDelinquenciies llimited to 10')
ax.set_ylabel('Count')
ax.set_title('Current Delinquencies(x-axis limited to 0>x<10)')

plt.show()

Here most borrowers have 0 current delinquencies. I’ve limited the x-axis to 20 so that we can see the abrupt decline in current delinquencies.


Let's take a look at loan status counts

In [457]:
cpal=palettable.matplotlib.Inferno_8.hex_colors
In [458]:
loan.LoanStatus.value_counts()
Out[458]:
Current                   56576
Completed                 38074
Chargedoff                11992
Defaulted                  5018
Past Due (1-15 days)        806
Past Due (31-60 days)       363
Past Due (61-90 days)       313
Past Due (91-120 days)      304
Past Due (16-30 days)       265
FinalPaymentInProgress      205
Past Due (>120 days)         16
Cancelled                     5
Name: LoanStatus, dtype: int64

I classified the LoanStatus into Performing and NonPerforming Categories.

The performing Category consists of;

  • The 'Completed' class
  • The 'Current' class (since the haven't defaulted)
  • The 'FinalPaymentInProgress' class
In [459]:
#classified the LoanStatus into performing and NonPerforming.
loan['Loan_stat1']=(loan.LoanStatus.isin(['Completed','Current','FinalPaymentInProgress'])).astype('category')
In [460]:
loan.Loan_stat1.value_counts()
Out[460]:
True     94855
False    19082
Name: Loan_stat1, dtype: int64
In [461]:
#loan_stat.Loan_stat=loan_stat.Loan_stat.astype('category')
In [462]:
loan.Loan_stat1.dtype
Out[462]:
CategoricalDtype(categories=[False, True], ordered=False)
In [463]:
#rename the newly created loan_stat category column
loan.Loan_stat1.cat.rename_categories({False:'NonPerforming',True:'Performing'},inplace=True)
In [464]:
loan.Loan_stat1.value_counts()
Out[464]:
Performing       94855
NonPerforming    19082
Name: Loan_stat1, dtype: int64
In [465]:
#tried this initially, didnt't work
#loan_stat.Loan_stat.cat.set_categories(['Performing','NonPerforming'],inplace=True)
In [466]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
In [467]:
#set_my_style()
plt.subplots(figsize=[16,5])
plt.subplots_adjust(wspace=0.2)
sb.set_context('notebook')


plt.subplot(121)
ax=loan.LoanStatus.value_counts().plot.bar(alpha=0.9,color=prosper_hue2[0])
plt.title('Whats the status of the loans?')
plt.xlabel('Loan Status')
plt.ylabel('Number of Listings')

plt.subplot(122)
ax=sb.countplot(x='LoanStatus',data=loan,hue='Loan_stat1',hue_order=['Performing','NonPerforming'],alpha=0.9,
                order=loan.LoanStatus.value_counts().index,
                dodge=False,
                palette=['#FE7702', '#E10C79'])

ax.legend(loc='upper right',prop={'size':12})
plt.xticks(rotation=90)
ax.set_yscale('log')
ax.set_ylabel('Number of Listings')

#hue_order=np.sort(loan.Loan_stat.unique())

plt.show()

We can see tha a lot of the loans are current and so far, have no issues. It was a bit difficult to compare the other loan status so I used a log scale for the other plot and seperated the Performing from the NonPerforming loans to give us a clearer picture of the loan status.

Lets define HighRisk loans as be loans that are PastDue, Chargeoff or Defaulted; Completed loans are loans in Completed, FinalPaymentInProgress and Cancelled.


Is the Borrower a Home Owner?

In [645]:
'IsBorrowerHomeowner'
ax=loan.IsBorrowerHomeowner.value_counts().plot.bar(alpha=0.9,color=prosper_hue2[0])
plt.title('Is Borrower A Home Owner')
#plt.xlabel('Loan Status')
plt.ylabel('Count')

plt.show()
In [469]:
loan.IsBorrowerHomeowner.value_counts()
Out[469]:
True     57478
False    56459
Name: IsBorrowerHomeowner, dtype: int64

From the above plot we can tell that the ratio of home owners to those who do not own homes is almost equal, we did a quick check to verify.


Borrower Annual Payment Rate

This is the Borrower's Annual Percentage Rate (APR) for the loan. Let's see what we can glean from plots of this data.

In [470]:
loan_BorrowerAPR=loan.copy()
In [471]:
loan_BorrowerAPR.dropna(subset=['BorrowerAPR'],inplace=True)
In [643]:
plt.figure(figsize=(15,4.4))
plt.subplots_adjust(wspace = 0.5)
plt.subplot(121)
ax=sb.distplot(loan_BorrowerAPR.BorrowerAPR,kde=False,hist=True)
ax.set_ylabel('Count')
ax.set_title('Borrower Annual Payment Rate')
plt.subplot(122)
ax=sb.distplot(loan_BorrowerAPR.BorrowerAPR,kde=False,hist=True,bins=300)
ax.set_ylabel('Count')
ax.set_title('Borrower Annual Payment Rate')

plt.show()
In [473]:
loan_BorrowerAPR.BorrowerAPR.mode()
Out[473]:
0    0.35797
dtype: float64
In [634]:
loan_BorrowerAPR.BorrowerAPR.mean()
Out[634]:
0.218827655909788
In [621]:
loan_BorrowerAPR[loan_BorrowerAPR.BorrowerAPR ==(0.35356 ].sample(5)
Out[621]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn ProsperRating (numeric) ProsperRating (Alpha) ProsperScore ListingCategory (numeric) BorrowerState Occupation EmploymentStatus EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup GroupKey DateCreditPulled CreditScoreRangeLower CreditScoreRangeUpper FirstRecordedCreditLine CurrentCreditLines OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years PublicRecordsLast10Years PublicRecordsLast12Months ... AvailableBankcardCredit TotalTrades TradesNeverDelinquent (percentage) TradesOpenedLast6Months DebtToIncomeRatio IncomeRange IncomeVerifiable StatedMonthlyIncome LoanKey TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate ProsperPrincipalBorrowed ProsperPrincipalOutstanding ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter MemberKey MonthlyLoanPayment LP_CustomerPayments LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors Loan_stat1
26036 BA63359011216737325A9A8 923167 2013-09-25 18:48:20.577 NaN 36 Past Due (1-15 days) NaN 0.35356 0.3134 0.3034 0.27598 0.1625 0.11348 1.0 HR 2.0 1 CA Construction Employed 167.0 False False NaN 2013-09-25 18:48:22 640.0 659.0 2007-08-14 00:00:00 8.0 7.0 10.0 6 204.0 2.0 10.0 0.0 0.0 0.0 1.0 0.0 ... 1315.0 10.0 1.00 2.0 0.25 $25,000-49,999 True 2500.000000 67A936949465273397D9B0B NaN NaN NaN NaN NaN NaN NaN NaN 3 NaN 5 104597 4000 2013-10-07 Q4 2013 BA5F3694616367847ABBCDB 172.76 691.0400 248.22 442.8200 -13.66 -29.37 0.0 0.0 0.0 1.0 0 0 0.0 39 NonPerforming
24745 128935913253989259A7F10 935530 2013-10-03 12:20:37.767 NaN 36 Current NaN 0.35356 0.3134 0.3034 0.27598 0.1625 0.11348 1.0 HR 2.0 1 MO Food Service Employed 2.0 True False NaN 2013-10-03 12:20:39 640.0 659.0 1993-05-25 00:00:00 6.0 6.0 24.0 5 305.0 0.0 3.0 1.0 20647.0 4.0 1.0 0.0 ... 170.0 22.0 0.81 0.0 0.53 $1-24,999 True 1250.000000 D46836958908315383ED6EE NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 5 104966 2000 2013-10-08 Q4 2013 244436941018269780C6474 86.38 343.8027 139.75 204.0527 -6.57 0.00 0.0 0.0 0.0 1.0 0 0 0.0 1 Performing
19240 F988358743150087646642F 880959 2013-08-27 09:34:27.770 NaN 36 Past Due (61-90 days) NaN 0.35356 0.3134 0.3034 0.27600 0.1575 0.11850 1.0 HR 3.0 1 WA Other Employed 208.0 True False NaN 2013-08-27 09:34:06 680.0 699.0 1992-10-27 00:00:00 7.0 6.0 25.0 3 102.0 4.0 6.0 0.0 0.0 0.0 0.0 0.0 ... 3058.0 20.0 1.00 1.0 5.55 $1-24,999 True 125.000000 98163692224514203F77A0F 1.0 9.0 9.0 0.0 0.0 7500.0 6868.47 -51.0 71 NaN 7 100223 4000 2013-08-29 Q3 2013 E35D3561595454326C4EA45 172.76 518.2800 207.68 310.6000 -9.91 0.00 0.0 0.0 0.0 1.0 0 0 0.0 64 NonPerforming
35622 FD513588992140499A1F62A 916954 2013-09-21 10:17:48.427 NaN 36 Current NaN 0.35356 0.3134 0.3034 0.27598 0.1625 0.11348 1.0 HR 3.0 7 OH Other Employed 147.0 False False NaN 2013-09-21 10:17:50 660.0 679.0 1991-06-17 00:00:00 4.0 4.0 39.0 3 161.0 1.0 3.0 0.0 0.0 7.0 1.0 0.0 ... 375.0 35.0 0.77 1.0 0.15 $25,000-49,999 True 3333.333333 DF3E369322127877862AD37 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 6 102432 4000 2013-09-24 Q3 2013 56B53692535892003D0E241 172.76 863.8000 356.88 506.9200 -16.18 0.00 0.0 0.0 0.0 1.0 0 0 0.0 18 Performing
87582 70DF35898308441226467E1 933968 2013-09-27 16:34:25.983 NaN 36 Current NaN 0.35356 0.3134 0.3034 0.27598 0.1625 0.11348 1.0 HR 3.0 1 AR Nurse (LPN) Employed 97.0 False False NaN 2013-09-27 16:34:28 660.0 679.0 2007-05-02 00:00:00 16.0 15.0 21.0 13 374.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 ... 2443.0 17.0 1.00 2.0 0.43 $25,000-49,999 True 2833.333333 38A736941717705705CA4D2 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 5 103659 4000 2013-10-02 Q4 2013 166D3694085509108481543 172.76 863.8000 362.81 500.9900 -15.98 0.00 0.0 0.0 0.0 1.0 0 0 0.0 26 Performing

5 rows × 82 columns

A lot of the BorrowerAPR occurs at 0.35797 i tried to take a look manually to see if i could determine why but i was unsucessful, I will explore this further later.

ProsperScore and CreditGrade

In [480]:
plt.figure(figsize=(15,4.4))
plt.subplots_adjust(wspace = 0.5)
plt.subplot(121)
ax=sb.countplot(x='ProsperScore',data=loan,alpha=0.9,palette=[prosper_hue2[0]])
plt
plt.subplot(122)
ax=sb.countplot(x='CreditGrade',data=loan,alpha=0.9,
                palette=[prosper_hue2[0]])

‘CreditGrade’ is the pre-2009 rating, and ‘ProsperRating’ is the more sophisticated post-2009 rating. Also, ‘CreditGrade’ has around 29,000 values and ‘ProsperRating’ has a little more than 85,000 values. Incomparison to overall count, there are less High rating scores post 2009 than there were in pre-2009. maybe it is because after the company restructured, they implemented a stricter rating system.

Let's do a little data formating of the columns

In this section, I renamed the values in some columns hereby putting some of them into broader classifications, I then converted these columns values to categorical values.

In [481]:
#change LoanStatus to category type
loan.LoanStatus=loan.LoanStatus.astype('category')
In [482]:
#make new loanStatus columns
loan['Loan_stat2']=loan['LoanStatus']
loan['Loan_risk']=loan['LoanStatus']
In [483]:
#loan.Loan_stat2.dtype
In [484]:
#loan.Loan_risk.dtype
In [485]:
loan.LoanStatus.value_counts()
Out[485]:
Current                   56576
Completed                 38074
Chargedoff                11992
Defaulted                  5018
Past Due (1-15 days)        806
Past Due (31-60 days)       363
Past Due (61-90 days)       313
Past Due (91-120 days)      304
Past Due (16-30 days)       265
FinalPaymentInProgress      205
Past Due (>120 days)         16
Cancelled                     5
Name: LoanStatus, dtype: int64
In [486]:
clean_Loan_stat_risk={'Loan_stat2':{'FinalPaymentInProgress':'Completed','Cancelled': 'Completed',
                                    'Past Due (1-15 days)': 'PastDue','Past Due (16-30 days)':'PastDue',
                                    'Past Due (31-60 days)':'PastDue','Past Due (61-90 days)':'PastDue',
                                    'Past Due (91-120 days)':'PastDue','Past Due (>120 days)':'PastDue'},
                      
                      'Loan_risk':{'FinalPaymentInProgress':'Completed','Cancelled': 'Completed',
                                   'Past Due (1-15 days)': 'HighRisk','Past Due (16-30 days)':'HighRisk',
                                   'Past Due (31-60 days)':'HighRisk','Past Due (61-90 days)':'HighRisk',
                                   'Past Due (91-120 days)':'HighRisk','Past Due (>120 days)':'HighRisk',
                                   'Defaulted':'HighRisk','Chargedoff':'HighRisk'}}
In [487]:
#use replace to change categorys values
loan.replace(clean_Loan_stat_risk,inplace=True)
In [488]:
loan.Loan_stat2.value_counts()
Out[488]:
Current       56576
Completed     38284
Chargedoff    11992
Defaulted      5018
PastDue        2067
Name: Loan_stat2, dtype: int64
In [489]:
loan.Loan_risk.value_counts()
Out[489]:
Current      56576
Completed    38284
HighRisk     19077
Name: Loan_risk, dtype: int64
In [490]:
loan.Loan_stat2=loan.Loan_stat2.astype('category')
loan.Loan_risk=loan.Loan_risk.astype('category')
In [491]:
loan.Loan_stat2.dtype
Out[491]:
CategoricalDtype(categories=['Chargedoff', 'Completed', 'Current', 'Defaulted',
                  'PastDue'],
                 ordered=False)
In [492]:
loan.Loan_risk.dtype
Out[492]:
CategoricalDtype(categories=['Completed', 'Current', 'HighRisk'], ordered=False)
In [493]:
#loan.LoanOriginationQuarter.value_counts()
#I would have loved to convert this variable in order to explore it but i'm out of time
In [494]:
loan.LoanOriginationQuarter.dtype
Out[494]:
dtype('O')

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

CA has the highest volume of loans and almost double the next following states TX and NY.

The BorrowerAPR is a left-skewed normal distribution and it has a spike at ~0.358. 'DebtToIncomeRatio' as well as 'CurrentDelinquencies' both had a large range of values, so i set the x-axis to a log scale and it shows that most of their values were at 1 and below 10 respectively. i.e both are right-skewed.

We have a right skewed multi-modal disistribution for the LoanOriginalAmount. The initial figure showing loanOriginalAmount distribution shows that a lot of the loans are more than \$0 and \$10,000. I then increased the binsize and added a line that shows the average mark for the loan amounts and we can see clearly that a lot of the loan amounts are concenterated to the left of the graph. Also, we can now see that the are several peaks at \$5,000, \$10,000, \$15,000 and also to a lesser degree at \$20,000, and \$25,000. Interestingly, there's a steep jump in frequency at these amounts rather than a smooth ramp up. This suggests that people are more likely to request for loan amounts that are multiples of \$5,000.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

For LoanStatus, for future visualizations, I changed all the different past due values and i changed the column dtype to categorical.

Bivariate Exploration

In [495]:
#Remove outstanding loans

loan_historical = loan[loan["Loan_risk"] != "Current"]
#df_historical["LoanStatus"].value_counts()
In [496]:
plt.figure(figsize=(15,5))

plt.subplot(121)
ax=sb.stripplot(data=loan,x='Loan_risk',y='LoanOriginalAmount',
                edgecolor='W',linewidth=0.5,size=8,palette=prosper_hue3_plasma,
                jitter=True,dodge=True)
ax.set_xlabel('Loan Status')
ax.set_ylabel(' Original Loan Amount')
ax.set_title('Loan Amount vs RIsk')

plt.subplot(122)
ax=sb.stripplot(data=loan,palette=prosper_hue_11_plasma,x='ProsperScore',y='LoanOriginalAmount',
                edgecolor='W',linewidth=0.4,size=8,
                jitter=True,dodge=True)
ax.set_xlabel('Prosper Score')
ax.set_ylabel('Original Loan Amount')
ax.set_title('Loan Amount vs ProsperScore')
plt.show()
In [497]:
#from palettable.colorbrewer.sequential import Blues_8
sb.set_context('notebook')
plt.figure(figsize=(18,6))
ax=sb.countplot(x='LoanStatus',data=loan,hue='IncomeRange',alpha=0.9,palette=prosper_hue_medium_plasma)
ax.legend(loc='upper right',prop={'size':12})
plt.xticks(rotation=90)
ax.set_yscale('log')
plt.title('LoanStatus Vs IncomeRange')
plt.show()
In [498]:
#from palettable.colorbrewer.sequential import Blues_8
sb.set_context('notebook')
plt.figure(figsize=(18,6))
ax=sb.countplot(x='LoanStatus',data=loan,hue='ProsperScore',alpha=0.9,palette=prosper_hue_11_plasma)
ax.legend(loc='upper right',prop={'size':12})
plt.xticks(rotation=90)
ax.set_yscale('log')
plt.title('LoanStatus Vs ProsperScore')
plt.show()
In [499]:
#loan['ProsperRating (numeric)']=loan['ProsperRating (numeric)'].astype(int)
In [500]:
#loan_test_df.sort_values(by='ListingKey')
In [642]:
#created a df for loan AailableBankCreditCard with out the outlier values
loan_ABCC_less_90=loan[loan.AvailableBankcardCredit<90000]

#plotted regression for BorrowerRate vs DebtTpIncomeRatio and BorrowerRate Vs AvailableBankCreditCard
plt.figure(figsize=(15,4.4))
plt.subplots_adjust(wspace = 0.5)

plt.subplot(121)
ax=sb.regplot(data=loan,x='DebtToIncomeRatio',y='BorrowerAPR')
plt.subplot(122)
ax=sb.regplot(data=loan_ABCC_less_90,y='BorrowerAPR',x='AvailableBankcardCredit')
#ax.set_ylim(0,0.4)
plt.show()
In [501]:
binsize=500
bins=np.arange(0,loan.LoanOriginalAmount.max()+binsize,binsize)
#fig,
plt.figure(figsize = [17, 5])
plt.subplots_adjust(wspace = 0.3)
sb.set_palette(prosper_hue2)

plt.subplot(121)
ax=sb.distplot(loan.LoanOriginalAmount[loan.BorrowerState=='CA'],kde=False,label='Califonia')
ax=sb.distplot(loan.LoanOriginalAmount[loan.BorrowerState=='TX'],kde=False,label='Texas')
ax=sb.distplot(loan.LoanOriginalAmount[loan.BorrowerState=='NY'],kde=False,label='NewYork')
ax=sb.distplot(loan.LoanOriginalAmount[loan.BorrowerState=='FL'],kde=False,label='Florida')
ax.legend()
plt.title("Origial Loan Amount for Top 4 states")
#plt.xticks(rotation=90)
plt.ylabel('count')
#ax.set_yscale('log')
#sb.distplot(loan.LoanOriginalAmount,bins=bins,color='#FE7702')

plt.subplot(122)
ax=sb.kdeplot(loan.LoanOriginalAmount[loan.BorrowerState=='ND'],shade=True,label='North Dakota')
ax=sb.kdeplot(loan.LoanOriginalAmount[loan.BorrowerState=='ME'],shade=True,label='Maine')
ax=sb.kdeplot(loan.LoanOriginalAmount[loan.BorrowerState=='WY'],shade=True,label='Wyoming')
ax=sb.kdeplot(loan.LoanOriginalAmount[loan.BorrowerState=='IA'],shade=True,label='Iowa')
ax.legend()
plt.xlabel('Original Loan Amount')
plt.title("Origial Loan Amount for bottom 4 states")

plt.show()
In [502]:
plt.figure(figsize=(15,4.4))
plt.subplots_adjust(wspace = 0.5)
plt.subplot(121)
ax=sb.countplot(x='ProsperScore',data=loan,hue='Loan_stat1',hue_order=['Performing','NonPerforming'],alpha=0.9,
                palette=['#FE7702', '#E10C79'])
plt.subplot(122)
ax=sb.countplot(x='CreditGrade',data=loan,hue='Loan_stat1',hue_order=['Performing','NonPerforming'],alpha=0.9,
                palette=['#FE7702', '#E10C79'])

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Graph of Prosper Score vs LoanStatus shows that it has a left-skewed distribution for completed loan, which means completed loans have primarily good ratings. However, Prosper score distributed a bell-shaped in high risk loan.

This is particularly observed in past-Due (32-60 days). Compared to Prosper Rating with a left-skewed shape, it seems like Prosper Score presents a less ability to detect the high risk loans.

From the plot of both ProsperScore and CcreditGrade vs LoanStatus(i.e Performing and NonPerforming Loans), I observed that Prosper Score was wayy more efficient at determining the performance of a loan than creditGrade was. This indicates that the risk management system at Prosper is efficient.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

BorrowerAPR seems to be positively correlated with DebtToIncomeRatio and negatively correlated with AvailableBankCardCredit. It seems that a higher available bank card credit will lower your interest rate. I theorize that when a borrower has relatively low credit amount the probabilities of getting low and high interest rates are similar, but with a high available bank credit, an individual is more likely to get a lower interest rate.

Multivariate Exploration

In [706]:
# boxplot of 'Loan Amount vs Income Range vs LoanStatus'
plt.figure(figsize=(16,7))
ax=sb.boxplot(data=loan,hue=loan.Loan_stat1,x=loan.IncomeRange,y=loan.LoanOriginalAmount,dodge=True)
plt.xticks(rotation=90)
ax.set_title('Loan Amount vs Income Range vs LoanStatus')

plt.show()
In [504]:
#plt.figure(figsize=(22,22))
g = sb.PairGrid(loan, vars=['TotalProsperLoans', 'LoanOriginalAmount', 'AmountDelinquent', 'DebtToIncomeRatio'],
                 hue='EmploymentStatus', palette=prosper_hue_medium_plasma)
g.map(plt.scatter, alpha=0.8)
g.add_legend();
plt.show()
In [709]:
#plt.figure(figsize=(15,7))
sb.set_context('notebook')
sb.lmplot(data=loan,size=7,aspect=2,x='DebtToIncomeRatio',y='BorrowerAPR',hue='CreditGrade',
               fit_reg=False,palette=prosper_hue_medium_plasma_r)
plt.show()
In [506]:
init_notebook_mode(connected=True)
In [679]:
loan_no_nan=loan.copy()
loan_no_nan2=loan.copy()
In [680]:
loan_no_nan.dropna(subset=['ProsperScore'],inplace=True)
loan_no_nan2.dropna(subset=['CreditGrade'],inplace=True)
In [682]:
px.scatter(loan_no_nan2,trendline='ols',x='LoanOriginalAmount',
           y='AmountDelinquent',color='CreditGrade',
           log_x=False,facet_col='Loan_risk',
           log_y=True)
In [668]:
px.scatter(loan_no_nan,trendline='ols',trendline_color_override='Black',x='LoanOriginalAmount',
           y='StatedMonthlyIncome',color='ProsperScore',log_x=True,facet_col='Loan_risk',
           log_y=True)

In [548]:
loan_Borrower=loan.copy()
loan_Borrower2=loan.copy()
In [549]:
loan_Borrower.dropna(subset=['CreditGrade'],inplace=True)
loan_Borrower2.dropna(subset=['ProsperScore'],inplace=True)
In [517]:
px.box(loan_no_nan,x='ProsperScore',color='IsBorrowerHomeowner',
           y='BorrowerAPR',log_x=False,facet_col='Loan_stat1',
           log_y=False)

In [518]:
loan_parallel=loan.copy()
In [519]:
loan_parallel['Par_col']=loan_parallel['Loan_risk']
loan_parallel['Par_col2']=loan_parallel['Loan_stat2']
In [520]:
clean_Loan_risk={'Par_col':{'Completed':1,'Current': 2,'HighRisk':3},
                 'Par_col2':{'Completed':1,'Current':2, 'PastDue':3, 'Defaulted':4, 'Chargedoff':5}}
In [521]:
loan_parallel.replace(clean_Loan_risk,inplace=True)
In [522]:
loan_parallel.Par_col.dtype
Out[522]:
dtype('int64')
In [523]:
#loan_parallel_no_na=loan_parallel.dropna(subset=['ProsperScore'])
#loan_parallel_no_na.info()
In [685]:
init_notebook_mode(connected=True)

df=loan_parallel
data = [
    go.Parcoords(
        line = dict(color = df['Par_col'],
                   colorscale = [[0,'#84807A'],[0.5,"#FE7702"],[1,'#E10C79']]),
        #line = dict(color = df['colorVal'],
        #           colorscale = 'Jet',
        #           showscale = True,
        #           reversescale = True,
        #           cmin = -4000,
        #           cmax = -100),
        dimensions = list([
            dict(range = [100,35000],
                 constraintrange=[20000,25000],
                 label = 'Original Loan Amount', values = df['LoanOriginalAmount']),
            dict(tickvals = [1,2,3,4,5],
                 constraintrange=[4,5],
                 ticktext = ['Completed', 'Current', 'PastDue', 'Defaulted', 'Chargedoff'],
                 label = 'Loan Status', values = df['Par_col2']),
            dict(range=[0.0,10.01],
                 label = 'DebtToIncomeRatio', values = df['DebtToIncomeRatio']),
            dict(range = [0.0,175500],#monthly income range too high 0.0-1755000 reduce it
                 visible = True,
                 label = 'Monthly Income(Stated)', values = df['StatedMonthlyIncome']),
            dict(range = [0,40],
                 label = 'Recommendations', values = df['Recommendations']),
            dict(range = [1,1189],
                 label = 'Investors', values = df['Investors']),
            dict(range = [0,8],
                 label = 'TotalProsperLoans', values = df['TotalProsperLoans']),
            dict(range = [-1,12],
                 constraintrange=[9,11],
                 tickvals = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0],
                 label = 'Prosper Score', values = df['ProsperScore'])
            #[0.0,646285.0] also add incomerange
        ])
    )
]

layout = go.Layout(
    title='Parallel plot showing interaction of 9 different variables',
    plot_bgcolor = '#E5E5E5',
    paper_bgcolor = '#E5E5E5',
)


fig = go.Figure(data = data, layout = layout)
plotly.offline.iplot(fig,filename='Prosper_loans_parallel')
#plotly.offline.plot(fig, filename = 'Proser_loans.html')
In [686]:
#Occupation_group.dropna(subset=['Prosper_score_clean'],inplace=True)

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

From the BoxPlot of the LoanOriginal amount vs Income Range, when grouped by performing vs Non-Performing loans, we observe that the two categories where non-performing loans is higher than performing loans are in the \$0 income range and the non employed.

Borrower ARP decreases with improving Prosper Score. For performing loans, it was observed that prosperScore of >10 enjoy the least BorrowerAPR, with a relatively short range of about 0.05-0.14, and it seems Home owners enjoy an even smaller BorrowerARP compared to non-home owners.

For the non-performing loan group, from a prosperScore of 0-5, the borrower Annual Percentage Rate is about the same with variation in range, however from the rating score of >6, it decreases with improving score. Also here, the BorrowerAPR for home owners appears to be lower than for non-home owners.

From the plot of StatedMonthlyIncome vs LoanOriginalAmount, we can see that the ordinary leeast square regression line shows an upward trend, meaning that as statedMonthly income increased, the loan amoint increased. this trend is more apparent in the prosperScore group of 10.0

For BorrowerAPR vs DebttoIncomeRatio, firstly, most people tend to have debt-to-income ratios below 1, regardless of risk category. Iobserved alos that AA category loans seem to have a lower BorrowerAPRs and a smaller range of debt-to-income ratios, both of which indicate less risk and APR increases as the rating gets riskier. Most people tend to have debt-to-income ratios below 1, regardless of risk category.

Were there any interesting or surprising interactions between features?

From the 'Parallel plot showing interaction of 9 different variables' I observed that they were a lot of loans which had a prosperScore of 9,10 and 11 that were either chargedoff or defaulted,some of these loans with original loan amount between \$20,000 and \$25,000 had monthlyIncome Amounts close to the borrowed amounts.